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Abstract 

Motivation: Public and private repositories of experimental data are 
growing to sizes that require dedicated methods for finding relevant data. 
To improve on the state of the art of keyword searches from annotations, 
methods for content-based retrieval have been proposed. In the context 
of gene expression experiments, most methods retrieve gene expression 
profiles, requiring each experiment to be expressed as a single profile, typ¬ 
ically of case vs. control. A more general, recently suggested alternative 
is to retrieve experiments whose models are good for modelling the query 
dataset. However, for very noisy and high-dimensional query data, this 
retrieval criterion turns out to be very noisy as well. 

Results: We propose doing retrieval using a denoised model of the query 
dataset, instead of the original noisy dataset itself. To this end, we intro¬ 
duce a general probabilistic framework, where each experiment is modelled 
separately and the retrieval is done by finding related models. For retrieval 
of gene expression experiments, we use a probabilistic model called prod¬ 
uct partition model, which induces a clustering of genes that show similar 
expression patterns across a number of samples. The suggested metric for 
retrieval using clusterings is the normalized information distance. Empir¬ 
ical results finally suggest that inference for the full probabilistic model 
can be approximated with good performance using computationally faster 
heuristic clustering approaches (e.g. fc-means). The method is highly 
scalable and straightforward to apply to construct a general-purpose gene 
expression experiment retrieval method. 

Availability: The method can be implemented using standard cluster¬ 
ing algorithms and normalized information distance, available in many 
statistical software packages. 
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1 Introduction 


As the use of high-throughput molecular measurement technologies continues 
to spread, an ever increasing amount of data from biological experiments is 
being stored in publicly available repositories. It is then often of interest for 
researchers to retrieve experimental datasets with relevance to a given exper¬ 
iment, in order to increase the power of statistical analyses and to be able to 
make novel findings not obtainable from one experiment alone. The current 
standard pr actice relies on se arching for relevant experiments by keyword anno¬ 
tations ie.g. IZhu et al.! . I 2 OO 8 I 1 . However, despite efforts to maintain compliance 
with standard f ormats of documenting experiments, e.g. the MIAME standard 
(jBrazmal 120011 1. information about experiments m ay often be missing, insuf¬ 
ficient or suffer from varia tions in terminology (e.g. Baumgartner et a n. 120071: 
Schmidberger et~^ . I 2 OIIII . In view of the challenges associated with keyword- 


based retrieval, the complementary task of querying a database of experiments 
using measurement data, instead of keywords, has recently received increased 
attention in the literature. 

Most earlier content-driven methods used for retrieval of gene expression 
data represent each experiment in terms of a profile over genes, or alternatively, 
over known gene se t s or gene modu l es pr e dicted from o t her d a ta sources, see 
Hunter fit al. (2001 1:lFuiibuchi et al. ( 2007 1: Caldas et al. ( 2009^ : lEngreitz et a,l\ 


(I2OIOII : iGeorgii et al\ l|2012l l and references therein. A representative example 
is to compute differential expression profiles of case vs. control, use the cor¬ 
relation between activity profiles as the measu re of relevance, and r etrieve the 
experiments with the highest correlations (e.g. Engreitz et ~al\. I2OIOII . This re¬ 
quires auxiliary information about the experiments, namely case and control 
labels of experiment samples, and possibly additional a priori defined sets of 
important genes. In the context of gene expression ti me series, represent ative 
examples of retrieving g ene expression profiles include ISmith et al\ (120081 1 and 


Hafemeister et aU ( 2011 1. 


Recently, two feasibility studies have gone beyond reducing experiments 
into single profiles by us ing probabili s tic m odelling of the experiments in the 
database being queried. IPaisal et~ ( 2014[ l. assumed that the query dataset 
can be explained as a mixture of the learnt models, each model learnt from one 
dataset, such that the measure of releva nce is given by th e inferred mixture 


weights. In a slightly different approach (ISeth et all 1201411 . experiments were 


retrieved by evaluating the posterior marginal likelihoods, given the query data, 
of individual models stored for the experiments in the database. 

In this paper, we introduce a method for retrieving full datasets, i.e. ex¬ 
periments consisting of multiple samples, which is also based on probabilistic 
modelling. However, instead of using the query dataset itself as a query, we use 
a model learnt from it. The measure of relevance is therefore not a likelihood, 
but instead a suitably defined metric between the models. The argument is 
that for noisy and complex datasets, it is beneficial to extract relevant charac¬ 
teristics of the query dataset in the same way as was done with the datasets 
that are being queried. We also make explicit the importance of marginalizing 
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out nuisance parameters which are not directly relevant for the retrieval task. 
For example, in a gene expression study, one is often more interested in how 
sets of genes are co-regulated, rather than their exact expression values which 
are additionally affected by numerous other influences. We tackle the specific 
proble m of retrieving gene expression experiments by using a product partition 
model ( Jordan et al. . 20071) to cluster together genes that show similar expres¬ 
sion patterns across a number of samples. By integrating out expression levels 
of the gene sets (i.e., cluster-specific information), only the co-expression pat¬ 
terns revealed by the clustering structure are retained. The clustering induced 
by the query dataset is then finally compared with the clusterings associated 
with the experime nts in the database using the normalized information distance 
( Vinh et al. . 2010l) . Notice that this approac h does not involve any “training 
stage”, compared to that of ISeth et al\ (120141) . and t he retrieval s t ep do es not 


involve solving an optimization problem, compared to lFaisal et al. ( 2014 ). 

While gene clustering has a long history in characterizing gene expression 
datasets ( Eisen et all 19991 : D’haesele^ 2005 ). it appears not to have been used 
in the context of experiment retrieval before. The use of gene clustering provides 
a straightforward way of characterizing each experiment with minimal prepro¬ 
cessing of the data while capturing central co-expression patterns. Essentially all 
previous approaches for retrieving gene expression data have converted the data 
to differential expression (or gene set enrichments) requiring fixed and known 
case-control distinctions. In contrast, we have only applied standard quality 
control and RMA normalization steps carried out in-house at the European 
Bioi nformatics Institute (E BI) for datasets in the Expression Atlas database 


Petrvszak et ai . 20I4h . Our experimental evaluation further suggests that. 


(see 

for the current application, inference of the full probabilistic model can be ap¬ 
proximated by some computationally faster heuristic clustering algorithm, such 
as fc-means (see Appendix]^. The computational simplicity makes the method 
highly scalable and easy to apply in a black-box manner, as a general-purpose 
retrieval scheme. 


2 Approach 

Let Dq denote a data matrix from some experiment of interest, and let {Dm}m=i 
be a database of M datasets from previously conducted experiments. The aim 
is to retrieve datasets from among the with similar characteristics as 

the query dataset Dq. Due to the complex nature of the data, there is no single 
sensible or obvious way of comparing datasets (matrices of possibly different 
sizes). We propose using a model to characterize each dataset, with the aim 
of reducing noise and making relevant aspects of the data more tangible, while 
making the experiments comparable. The retrieval task then consists in ranking 
the models inferred from {Dm}m=i^ with respect to their similarity 

with the query model Mq inferred from Dq. Note that in a broad sense, the 
commonly used differential expression can be considered as one model type, and 
clustering as another. 
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To elaborate on the above idea further, we will now assume that the data gen¬ 
erating mechanism of each dataset can be represented in terms of a probabilistic 
model with density / in some family {f {'10)10 G 0}. Often, the parameter 0 
can be decomposed as 0 = {X,iIj), where il) is the parameter or characteristic of 
interest (e.g., gene clusters) and A is a nuisance parameter (e.g., average expres¬ 
sion level of the gene cluster). Marginalizing out (integrating the density over) 
A then yields a model family completely determined hy ip G 5'. Making this 
operation explicit, the key quantity used in inferring a representative model for 
a dataset D is the marginal likelihood, 

p{D\^l,)= [ /(iA|A,V')^A|^(A|^)dA, (1) 

Ja 

where 'A'\\^{'\'ip) is a prior density on A. Ideally, we would then proceed with 
a fully Bayesian approach to infer a posterior density (or distribution) 7r^(-|iA) 
over dr, and use Ai := tt^{'\D) to characterize D. However, for computational 
reasons we will here choose only a single element of dr to represent D. Under 
zero-one loss, the optimal choice is then the maximum a posteriori (MAP) 
solution 

= argmax{p(D|V’)7r.^(r/’)}, (2) 

where is a prior over dr. Accordingly, we now define the representative model 
for D as Ai := ip. 

If a suitable function c? : AT x Af —>■ K can be defined for the pairwise 
relations between the elements of the model space Af, a natural ranking among 
Ail ,... ,AiM G will be induced by evaluating d{Aiq,Aim) for all m. For 
coherence of the ranking scheme, we will make a further assumption that d is a 
metric. That is, for all Ai,Ai',Ai" G M, we require that 

(Ml) d{Ai,Ai')>Q 

(M2) d(7W,7W')=Oifandonlyif7W=7W' 

(M3) d{Ai,Ai') = d{Ai',Ai) 

(M4) d{Ai,Ai")<d{Ai,Ai')+d{Ai',Ai"). 

With the above conditions satisfied, the function d conforms to the intuition of 
a distance, and furthermore, provides a solid foundation for the design of data 
structures and algorithms, as the model space M forms a metric space. We 
finally note that metrics are also available for probability distributions, mak¬ 
ing the described framework applicable in cases where computational resources 
allow for representing the elements of M as full posterior distributions. 

3 Methods 

3.1 Probabilistic model for gene clustering 

The first task in constructing a retrieval scheme is to choose an appropriate 
model for the experiments. While several different approaches, with varying 


4 



aims and assumptions, exist for modelling gene expression data, a particularly 
simp le and frequently used approach is that of gene clustering ('e.g. lD’haeseleer . 
2005ll . which seeks to cluster together genes that show similar expression pat¬ 


terns across a number of samples. Here, we use a probabilistic clustering ap¬ 
proach which simultaneously infers both the number of clusters as well as the 
optimal clustering structure. 

Consider first a gene expression data matrix D of dimension n x p, where 
n is the number of genes and p is the number of samples. A clustering S = 
{si..., Sk} is a partition of the set N = {1,..., n} into k G {1,..., n} non¬ 
empty and non-overlapping subsets, or clusters, such that = N and 

ScHSc' = 0, for c ^ c'. We focus here on a probabilistic formulation of clustering, 
which makes explicit use of partition structures, namely the product partition 
model (PPM). Technically, PPM assumes that items in the sa me cluster are 


excha ngeable and items in different clusters are independent f see I Jordan et al. 
1200 /11 . Using the terminology of Section [H the parameter of interest for this 
model is the partition structure S, while the nuisance parameter is a vector of 
cluster-specific model parameters, A = (Ai,..., Afc). This leads to a marginal 
likelihood of the form (see Equation ([T])) 

p{D\S)= [ f{D\X,S)7r^\s{MS)dX 

JA 

„ k k 

J A 


C=1 


C=1 


where ijC”) denotes the subset of D which is indexed by Sc- Note that the 
assumption of independence between clusters entails constructing the marginal 
likelihood as a product of cluster-specific components. 

The prior distribution for S will likewise be constructed as a product, 


P(S') = A:J|h(sc), for all fc G {!,...,n}, 


( 4 ) 


where K ensures normalization to 1 over the model space S and h[sc) > 0 for 
all subsets Sc- Note that (|4|) actually specifies the joint distribution for S and 
k, but since the latter is implied by the former, we omit k from the notation. It 
can be shown that a PPM with K and h{sc) chosen such that 


P(5) 


^0 nLid'Sci -1)! 

K=lV0 + ^-l ’ 


( 5 ) 


where |sc| is the number of observations in cluster Sc and rjQ > 0 controls the 
tendency to form new clusters, can be obtained by integrati ng out the model 
parameters in a Dirichlet process mixture model dPahll . [ioPOl) . 

The cluster-specific marginal likelihoods p[D^^’='>\sc) in Equation ([3]) can in 
principle take any suitable form. Here, we assume that for = [xij], i G Sc, 
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= 1 , 


,p, the observations in each sample j are independently generated 


from N(/Xcj,T“^) with a conjugate NormalGamma(/io, po) ^O)/3o) prior on the 
unknown model parameters. Furthermore, we make the simplistic assumption 
that the s amples themselves a re independent, conditional on a cluster assign¬ 
ment (see Hand and Yu . 200 ll for a discussion about the implications of this 
assumption in a classification context). The resulting cluster-specific marginal 
likelihoods may then be written as 


^ Pi ' 




_l£d/P qN ; r(aj) /3o° 

P,> r(ao)/3“- 


( 6 ) 


where 


Pi — PO + |Sc|i Q!j — Oq + „ ) ®i ~ I I ^ ^ i^ij ) 


Pj — /3o + „ ^3 ) + 


- ^2 , \Sc\po{Xj - Pof 


iesc 


‘^Pj 


Blomstedt et al. ( 2015h introduced a PPM for clustering mixed discrete 
and continuous data, where the continuous component was of form Fol¬ 
lowing their implementation, we normalize each column of the data matrix 
D = to have zero mean and unit variance, and set the hyperpa¬ 

rameter values to po = 0 and po = cto = /?o = 1- Furthermore, the model is 
equipped with a prior of the form ([S]), with rjQ = 1. Finally, combining Equa¬ 
tions dS])-®, an optimal clustering S w.r.t. a dataset D is given by the MAP 
solution (see Equation ([2])) 


S = argmax{p(T)|5')P(S')}. 
SeS 


(7) 


3.1.1 Inference 


To find the optimal clustering S' S 5 as defined in Equation © , we use a 
stochastic greedy search algorithm, which moves in the model space by suc- 
cessive applicati o n of m ove, split and merge operators; for further details, see 


Blomstedt et al\ ( 2015 1. While being more efficient for the optimization task 


than standard Markov chain Monte Carlo methods, for large amounts of data 
the algorithm still requires a considerable amount of computation time. To 
that end, some computational simplifications based on heuristic clustering pro¬ 
cedures will be discussed in Appendix 


3.2 Distance metric for clusterings 

Assuming now that each of the experiments in a database has been represented 
with a clustering S G S, the remaining task is to find a function d which can 
be defined on S and satisfies conditions (M1)-(M4) above. In recent years, a 
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new g e nerat i on of information- theoretic distance measures has emerged (see e.g. 
Meila . 2007 : Vinh et al . 2010[) . which possess many desirable properties, such 
as the metric property, and which have been employed because of their strong 
ma thematical f ounda tion and ability to detect non-linear similarities. 

Vinh et al. ( 2010ll conducted a systematic comparison of information-theoretic 


distance measures, concluding that the preferred “general-purpose” measure 
for comparing clusterings is the normalized information distance, later denoted 
dNiD- To give a definition of this measure, we first introduce some notation. 
Briefly, for two clusterings S and S', the number of items co-occurring in clus¬ 


ters Sc G 5 and Sc' G S' is given by ncd = |sc n s(,,|, with Sc'^i ^cc' 

The marginal sums are denoted by Uc- = X]c'=i ^cc' and n.^' = X]c=i 


A 

key realization in the derivation of information-theoretic distance measures is 
that each clustering induces an empirical probability distribution over the set 
{1,..., fc}, such that the probability of a randomly chosen item i G N being 
in cluster Sc is given by P(i G Sc) = Uc-jn. Similarly, the joint probabil¬ 
ity of the pair (i,j) G N x N co-occurring in clusters Sc and s'^, is given by 
]?((*, j) G Sc X s(,,) = Ucc'/n. The entropy of a clustering S, describing the 
uncertainty associated with assigning items into the clusters of S, is then for¬ 
mulated as 


i7(5) = -^P(zGSc 


logP(i G Sc). 


The mutual information of clusterings S and S', which measures how much 
having knowledge of S' reduces H{S) (or vice versa), is further defined as 

I(S.S') = ^ y: P((i,i) e X log 


c—1 c' — l 


P(i G Sc)P(j G s(,,) ■ 


It can also be interpreted as a measure of dependence in the sense that if S and 
S' are independent, then I{S,S') = 0. Finally, from the above quantities we 
obtain d^iD as 

I{S,S') 


dNIDiS,S') = l- 


me.x{H{S),H{S')}' 


( 8 ) 


4 Results 


4.1 Data and experimental setup 

To evaluate the modelling-based retrieval scheme developed in Sections [2] and (H 
we used as a starting point all differential expression experiments conducted on 
the A-AFFY-44 affymetrix genechip available in Expression Atlas (EA; http: //www. ebi .; 


see 


Petrvszak et all l2014ll as of 4-Jun-2014. Only experiments with both mea¬ 


surement data and analytics data available were considered. Furthermore, ex¬ 
periments with a very small number of genes were discarded. Since most exper¬ 
iments had expression measurements for more than 54 670 genes, this number 


. uk/gxa, 
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was set as the lower limit. Based on the above selection process we obtained 
an initial set of 447 experiments. In a second stage, we selected a subset of 
these experiments based on the availabi lity of experimental factor ontologies 
(EFO; http: //www. ebi . ac.uk/efo/, see lMalone et all . 201Clll . which were used 
as ground truth in the evaluation. More specifically, we retained those exper¬ 
iments which had at least one of the EFO types “cell type”, “disease” or “or¬ 
ganism part” present. Moreover, experiments having multiple values for a given 
EFO type were excluded, and finally only experiments with the same EFO value 
present in at least two experiments were included in this study, resulting in a 
final set of 251 experiments (for a list of accession numbers, see Appendix ICl) . 
The number of samples per experiment varied between 6 and 353, the median 
number of samples being 22. 

Out of the final set of 251 experiments, three partly ovelapping subsets 
corresponding to each of the EFO types were formed. These consisted of 103 
experiments with values recorded for “cell type”, 76 with values for “disease” 
and 174 with values for “organism part”. The number of different EFO values 
in these sets of experiments were 23, 19 and 32, respectively. In retrieving full 
experiments, those experiments having the same EFO value were considered 
relevant, and other experiments irrelevant. Note that the above EFO types 
were not the main conditions of interest on which differential gene expression 
had been studied in the experiments, but were chosen to give a more general 
description of the experiments. A more complete ground truth was not readily 
available as most other EFO types were only present in small subsets of the 
experiments. Retrival performance was measured using precision and recall, 
taken as an average of successively using each of the experiments as a query to 
retrieve among the remaining experiments. 

In order to reduce the number of genes for clustering, we initially selected 
for each of the 251 experiments the top 5 genes resulting from a ‘non-specific’ 
search in EA, in which genes with the highest absolute values of t-statistics in 
any available contrast come first, irrespective of whether they are reported with 
high t-stati stics in the rema i ning c ontrasts (for further details about listing genes 


EA, see Petrvszak et al . 2014ll . Finally, by taking the union of these genes 


over all experiments, we arrived at 1125 genes per experiment. The selection 
process per se is not an essential part of our approach but done for computational 
convenience only. In a preliminary stage of our analyses, we experimented with 
different numbers of genes but found that this only had a minor impact on the 
results, see Appendix IbI for further details. 


4.2 Comparison of retrieval schemes 

We will now proceed to evaluating the performance of the retrieval approach 
proposed in Section For gene expression data, we learn for each experiment 
a Gaussian product partition model (PPM) which implies a clustering over 
genes, see Section [3l The clustering Sq learned from the query data is then 
related to the clusterings Si,., Sm by evaluating the distances d]s[iu{Sq, Sm), 
m = 1,... ,M, see Equation (|5]). This approach will be contrasted with two 
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alternative approaches for content-based retrieval previously suggested in the 
literature. The first one of these is closely related to the proposed approach in 
that it learns a PPM for each experiment in the database. However, instead of 
evaluating distances, it evaluates the marginal likelihoods p{Dq\Sm) of the learnt 
models, given the query dataset. A higher likelihood is then an indication of a 
higher relevance to the query dataset. A similar approa. ch, albeit for a different 
model family, was recently suggested in ISeth et al. (12014 ). The te rm “modelling- 
based retrieval” has previously been used by iFaisal et al\ (|2014ll to describe an 
approach based on probabilistic modelling but using a likelihood as the measure 
of relevance. To make a distinction between the approach proposed here and 
approaches based on evaluating likelihoods, we will in this comparison refer to 
the former as model-distance-based retrieval and the latter as likelihood-based 
retrieval. See Section [5] for a further discussion about the differences between 
the two approaches. 

The second alternative approach, differential expression based retrieval, as¬ 
sumes that a statistical test to detect differentially expressed genes has been 
conducted beforehand. The method is then based on correlating the gene- 
specific differential expression p-values of the query experiment wit h those of the 
database experiments. An approach similar to this was suggested bv lEngreitz et al. 


( 2010ll . If targeted at differential expression profiles obtained under specific con¬ 
ditions known to be important, this scheme has much potential to achieve good 
retrieval performance. On the other hand, it assumes more background knowl¬ 
edge and preprocessing of the data than the suggested retrieval schemes based 
on gene clustering. Here, we do not assume a specific condition of interest but 
choose in each experiment for the selected 1125 genes the smallest p-values un¬ 
der any of the conditions tested and reported in Expression Atlas. We also 
experimented with a much larger set of 40 569 genes, constituting the maxi¬ 
mal common set of genes tested in all experiments, but this resulted in slightly 
inferior performance. The correlation measure used was Pearson’s correlation. 
We finally note that differential expression based retrieval schemes can also be 
formulated under the general framework of Section [2] using some ap propriate 
proba bilistic model for differential expression, as formulated in e.g. IDo et al 


The results of the comparison between the retrieval schemes are shown in 
Figure [TJ Here, the model-distance-based retrieval scheme clearly outperforms 
the two other schemes. A notable feature of the results is the surprisingly poor 
performance of the likelihood-based approach. This may be due to the well- 
known fact that gene expression measurements tend to be extremely noisy. In 
essence, the marginal likelihood piDq\Sm) measures how well the query dataset 
Dq is predicted by a model Sm, learnt from dataset D^- Even if experiments 
q and m are in some way related, the idealized model Sm rnay still not provide 
a good prediction for data Dq. Therefore, instead of using the complex and 
possibly very noisy dataset Dq as query input, retaining only the characteristics 
relevant for retrieval in both Dq and Dm may help to improve performance, as 
illustrated in the results. 
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Precision 





(c) Organism part 

Figure 1: Precision-recall curves comparing model-distance-based, likelihood- 
based, and differential expression (DE) based retrieval using three EFO types 
(a-c) as ground truth. 
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4.3 Biological information in gene clnstering 

Any single EFO type will necessarily capture only one aspect of an experiment, 
whereas a meaningful retrieval task usually involves an evaluation of relevance 
between experiments in terms of a combination of aspects. It is therefore of 
interest to study the effect of composing the ground truth as a combination 
of multiple EFO types. In the current experimental setup, the ground truth 
for each of the EFO types “cell type”, “disease” and “organism part”, can be 
represented as a symmetric binary matrix G of dimension M x M, such that 
entry gij = I iff experiments i and j are mutually relevant. A ground truth 
which requires a match in t EFO types can then be formed by summing the 
three matrices and requiring gij = t. 

In Figure m the model-distance-based retrieval scheme is evaluated against 
ground truth relevances requiring (a) any EFO type to match (t > 1) (b) two or 
more matches (t > 2) and (c) all EFO types to match (t = 3). The number of ex¬ 
periments satisfying these conditions are 251, 54 and 6, respectively. Intuitively, 
the ground truth can be considered increasingly informative as the number of 
matching EFO types required to declare relevance increases. A retrieval scheme 
capturing biologically relevant information should then be in better agreement 
with a more informative ground truth. Although the curves of Figures [5^ and 
I2bl are not directly comparable due to the differing number of experiments used, 
the shape of the latter gives an indication of a better agreement. In Figure 
[2cl owing to the small number of available experiments, the ground truth is 
compared with the single most relevant experiment (out of five possible ones) 
retrieved for each query. Here, the retrieval result matches the ground truth in 
four of the six queries. 


4.4 Annotations and gene clustering combined 

As noted previously in Section [1] information about experi ments may often be 


missing, insufficient or suffer from variations in terminology (|Baumgartner et al 


2007HSchmidberger et al 


MIAME criteria (jBrazma . 


200 despite a formal declaration of compliance with 
20011 1. Hence, even in cases where keyword-based re¬ 


trieval is of primary interest, it may be advantageous to complement a query 
with information provided by gene clustering. A straightforward way of com¬ 
bining these two types of information is the following. Assume that a database 
of M experiments is being queried and that L < M experiments are found 
to match the keyword query. More formally, the result can be encoded as a 
binary vector of length M with L elements having value 1. A model-distance- 
based retrieval scheme, on the other hand will return a vector of length M with 
each element representing the distance of the corresponding experiment-specific 
model to the query model. Element-wise multiplication of these vectors then 
effectively induces a ranking of the experiments retrieved in the keyword-based 
query. The underlying idea is that this ranking will reflect some information 
which is not present in the queried keyword(s) alone. 

To test the combined method, we considered all experiments matching in 
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(a) 




Retrieved experiment 

(c) 


Figure 2: Evaluation of model-distance-based retrieval scheme with respect to 
a ground truth requiring (a) at least one, (b) at least two, (c) exactly three 
matching EFO types. The rightmost subfigure compares the ground truth ma¬ 
trix (hollow squares) with the single most relevant retrieved experiment per 
query (solid squares) for the six experiments having a simultaneous match in 
all three EFO types. Accession numbers for the experiments are provided as a 
reference. 
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both “cell type” and “organism part”, resulting in a total of 43 experiments 
(all other combinations of two EFO types resulted in significantly less experi¬ 
ments). A match in both of these EFO types was used as ground truth. The 
idea was then to retrieve experiments assuming only one of the EFO types to 
be known, complementing keyword-based retrieval with rankings from model- 
distance-based retrieval. Retrieving experiments assuming only “cell type” to 
be known resulted in an average precision of 0.55 for keyword-based retrieval 
and 0.61 (mean average precision) for combined retrieval, the corresponding 
numbers being 0.81 and 0.84, respectively, when only “organism part” was as¬ 
sumed to be known. In both cases we were able to see a slight improvement 
in performance for the combined approach, suggesting that keyword-based re¬ 
trieval may benefit from being complemented with auxiliary information, such 
as gene clustering. 


5 Discussion 


In this paper, we have introduced a general probabilistic framework for content- 
driven retrieval of experimental datase ts. Compared to earlier works which 
also employ probabilist ic modelling (e.g. Caldas et all 20091 2012 : Faisal et ol. 


201 4t ISeth cnZI. 1201 4^ . we do not use the likelihood of the query data as a 
measure of relevance, but instead learn a model of the query data and compare 
models. We argue that this reduces noise in the query input. With nuisance pa¬ 
rameters further marginalized out, only characteristics relevant for the retrieval 
task are retained. A special instance of the general framework introduced in 
this p aper has been pre viously used as a comparative method in a simulation 


study (|Seth et all 1201411 with performance slightly inferior to a likelihood-based 


approach. The simulation setting in that earlier study was, however, very sim¬ 
plistic compared to datasets encountered in many real-life scenarios, such as 
that of Section 01 where the model-distance-based approach was now seen to 
clearly outperform its likelihood-based counterpart. 

Contrary to likelihood-based approaches, the model-distance-based approach 
requires all models under consideration to belong to the same family. Although 
this may seem somewhat restrictive, in particular for the potential future sce¬ 
nario in which individual rese archers independen tly store models in a repository 
along with their datasets (e.g. Faisal et~M 20I4h . there are also scenarios where 
the assumption is feasible. Datasets which arise as a result of some specific type 
of experiment are often in practice modelled using a fairly standardized set of 
approaches. In particular, if the models are constructed automatically, or by 
a curator of a data repository, the assumption of the models belonging to the 
same family is feasible. 

As a specific application of the general framework, in Sections [3] and 0] we 
proposed a retrieval scheme for gene expression experiments based on gene clus¬ 
tering. It turned out that clustering is even a surprisingly good model for this 
purpose; with minimal preprocessing and prior knowledge about the experi¬ 
ments, it is able to yield reasonable retrieval performance (Section 14.21) and 
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to capture biologically relevant characteristics about the experiments (Section 
lOl) . Finally, we showed that it is straightforward to combine model-distance- 
based (or any modelling-based) retrieval with retrieval using available keywords 
(Section I4.4|) . 
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Appendix 


A Simplified search for an optimal clustering 


Recall that a product partition model (PPM) is a probabilistic model which 
implies a clustering S = {si..., s^} of n data items into k < n non-empty and 
non-overlapping subsets. Given a dataset D, an optimal clustering S is given 
by the maximum a posteriori solution 

S = argmax{p(I?|S')P(5')}, 

SeS 


where S denotes the space of all possible clusterings of D. Since the cardinality 
|iS| of the model space grows very quickly with n, an exhaustive evaluation of all 
posterior probabilities P(S'|Z?) oc p(D|S')P(S'), S' G 5, is not feasible in practice 
(for instance, with n = 50, we already have |S| = 1.8572 x 10'^^). Therefore a 
stochastic greedy search algorithm was implemented in the analyses of Section 
4 to find the optimal clustering for each dataset. While being more efficient 
for the optimization task than standard Markov chain Monte Carlo methods, 
for large amounts of data the algorithm still requires a considerable amount of 
computation time. 

One possible simplification is to restrict the search to a subset 5* C 5 of 
the model space by choosing a set of potentially good solutions in advance, and 
then selecting the optimal solution among them as 

S* = aigm!ix{p{D\S)F{S)}. (9) 

ses* 


A straightforward way of finding a suitable S* is to only consider solutions found 
by one or several different heuristic clustering algorithms. These algorithms are 
usually fast to execute but provide no measure of uncertainty regarding the 
obtained solution and require the number of clusters k to be fixed in advance. 
Running such an algorithm for all values oi k G {!,...,n} will reduce the 
cardinality of the search space to |5*| = n, which in many cases is small enough 
to enable an exhaustive evaluation of the posterior probabilities of all clusterings 
in S*. Even a combination of, say, L different algorithms still yields a model 
space with a cardinality of only |iS* | = L • n. 

To further reduce the scope of the search, the range of k for which heuristic 
solutions are obtained may be restricted to an interval in which plausible solu¬ 
tions are likely to be found. For instance, in analysing how different distances 
and clustering metho d s inte ract regarding their ability to cluster gene expres- 


Jaskowiak et all ()2014ll conducted a comparison for clusterings generated 


Sion, _ 

in the interval k G {2,..., [-v/nl }j rather than the full range of values for k. 
In our current application, we additionally experimented with restricting k to 
fixed value, which trivially reduces the model space to a single clustering. In 
this case, as the number of clusters is not chosen adaptively for each dataset, 
the clusterings no longer provide biologically meaningful groupings of the genes 
but may still give a sufficient characterization of the experiments for purposes 
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of retrieval. This is demonstrated in Figure [31 where retrieval based on the 
optimal clustering in the full model space S is compared with that in a reduced 
model space S* of fc-means solutions in k G {2,..., [\/n] }, as well as a triv¬ 
ial model space Sq , consisting of only one fc-means so lution with k = |" \/nl2 \, 
corresponding to the midpoint of the interval used by i.Taskowiak et al\ ( 20141 1 . 

The quality of the solution in Q depends on how well the clusterings in 
S* (or Sq) correspond to those clusterings in S which have a high probability 
under the PPM formulation. Figure [4] shows a comparison of the retrieval per¬ 
formance of various heuristic clustering algorithms, with the number of clusters 
fixed for simplicity at fc = \y/n/2\ , and using PPM as baseline. The results indi¬ 
cate that heuristic algorithms which are based on a Euclidean distance measure 
(e.g. fc-means with squared Euclidean distance and complete linkage (CL) with 
Euclidean distance) yield retrieval performance which closest matches that ob¬ 
tained using the Gaussian PPM. Although similar behaviour may be expected 
in other datasets of the same type, the conclusion is, however, data-specific and 
should not be generalized beyond the scope of the current data without further 
validation. 
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Recall 

(a) Cell type 



Recall 

(b) Disease 



Recall 

(c) Organism part 

Figure 3: Retrieval performance using clusterings found in full (5), reduced 
(5*), and trivial (5 q) model spaces. 
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Figure 4: Retrieval performance for various heuristic clustering approaches, 
using PPM as baseline. 
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B Impact of number of genes 


In Section 4.1, the number of genes for clustering was reduced by initially se¬ 
lecting for each experiment the top 5 genes resulting fro m a ‘non-specific ’ searc h 
in Expression Atlas (http://www.ebi.ac.uk/gxa, see IPetrvszak et~^ . 2014 1. 
Taking the union of these genes over all 251 experiments finally resulted in 1125 
genes per experiment. To study the impact of the number of genes included in 
each dataset, we repeated the same procedure for the top 10 and top 25 genes, 
resulting in 2117 and 4740 genes per experiment, respectively. Due to the large 
number of genes, in particular in the last group, a simplified search scheme for 
clusterings was employed as described in the previous section, using fc-means 
with squared Euclidean distance measure and k G {2,..., \\/^ }. Figure [5] 
suggests that the number of genes chosen only has a minor impact on retrieval 
performance. 
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(a) Cell type (b) Disease 
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(c) Organism part 

Figure 5: Retrieval performance for different numbers of genes included for 
clustering. 
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C Experiment accession numbers 

Accession numbers of the 251 experiments selected for the analyses: 

E-GEOD-10070, E-GEDD-10233, E-GEOD-10289, E-GEDD-10311, E-GEOD-10315, 
E-GEOD-10595, E-GEDD-10696, E-GEDD-10718, E-GEDD-10780, E-GEOD-10799, 
E-GEOD-10820, E-GEOD-10821, E-GEDD-10831, E-GEDD-10879, E-GEOD-10890, 
E-GEOD-10896, E-GEDD-10916, E-GEDD-10971, E-GEDD-10979, E-GEOD-11057, 
E-GEOD-11166, E-GEDD-11199, E-GEDD-11281, E-GEDD-11309, E-GEOD-11324, 
E-GEOD-11348, E-GEDD-11352, E-GEDD-11428, E-GEDD-11755, E-GEOD-11761, 
E-GEOD-11783, E-GEDD-11839, E-GEDD-11886, E-GEDD-11919, E-GEOD-11941, 
E-GEOD-11959, E-GEDD-12034, E-GEDD-12108, E-GEDD-12113, E-GEOD-12121, 
E-GEOD-12172, E-GEOD-12251, E-GEDD-12254, E-GEDD-12264, E-GEOD-12265, 
E-GEOD-12287, E-GEDD-12355, E-GEOD-12408, E-GEDD-12452, E-GEOD-12710, 
E-GEOD-13487, E-GEDD-13501, E-GEDD-13548, E-GEDD-13637, E-GEOD-13762, 
E-GEOD-13763, E-GEOD-13837, E-GEOD-13899, E-GEOD-13911, E-GEOD-13975, 
E-GEOD-13987, E-GEDD-14001, E-GEDD-14017, E-GEOD-14278, E-GEOD-14383, 
E-GEOD-14390, E-GEOD-14479, E-GEOD-14924, E-GEDD-14926, E-GEOD-14973, 
E-GEOD-15271, E-GEDD-15389, E-GEDD-15543, E-GEDD-15645, E-GEOD-15811, 
E-GEOD-15947, E-GEDD-16020, E-GEDD-16214, E-GEDD-16237, E-GEOD-16363, 
E-GEOD-1643, E-GEDD-16515, E-GEOD-16728, E-GEOD-16797, E-GEOD-16836, 
E-GEOD-16837, E-GEDD-17251, E-GEOD-17385, E-GEDD-17400, E-GEOD-17636, 
E-GEOD-17743, E-GEDD-17763, E-GEOD-18018, E-GEDD-18791, E-GEOD-18842, 
E-GEOD-18913, E-GEDD-18995, E-GEDD-19067, E-GEDD-19293, E-GEOD-19639, 
E-GEOD-19665, E-GEDD-19784, E-GEOD-19804, E-GEDD-19826, E-GEOD-19864, 
E-GEOD-19982, E-GEDD-20114, E-GEDD-20540, E-GEDD-20948, E-GEOD-21261, 
E-GEOD-22152, E-GEDD-22229, E-GEOD-22513, E-GEDD-22544, E-GEOD-22563, 
E-GEOD-22779, E-GEDD-23031, E-GEDD-23687, E-GEDD-23806, E-GEOD-2397, 
E-GEOD-23984, E-GEDD-25518, E-GEDD-2634, E-GEOD-26495, E-GEOD-26673, 
E-GEOD-27034, E-GEDD-2706, E-GEOD-27187, E-GEOD-31193, E-GEOD-32719, 


E-GEOD-3467, 

E-GEDD-34748 

, E-GEOD-34880, E-GEOD-3526, E-GEOD-35972, 

E-GEOD-36547 

, E-GEDD-3678 

, E-GEOD-3744 

, E-GEOD-3998 

, E-GEOD-4567 

, E-GEOD-4600 

E-GEOD-4655, 

E-GEDD-4883, 

E-GEOD-4888, 

E-GEDD-5040, 

E-GEOD-5230, 

E-GEOD-5264, 

E-GEOD-5372, 

E-GEDD-5679, 

E-GEOD-6054, 

E-GEDD-6241, 

E-GEOD-6400, 

E-GEOD-6764, 

E-GEOD-7011, 

E-GEDD-7216, 

E-GEOD-7224, 

E-GEDD-7392, 

E-GEOD-7440, 

E-GEOD-7509, 

E-GEDD-7515, 

E-GEDD-7538, 

E-GEOD-7568, 

E-GEDD-7586, 

E-GEOD-7696, 

E-GEOD-7708, 

E-GEOD-7869, 

E-GEDD-7890, 

E-GEOD-8023, 

E-GEDD-8121, 

E-GEOD-8167, 

E-GEOD-8514, 

E-GEOD-8527, 

E-GEDD-8597, 

E-GEOD-8658, 

E-GEDD-8823, 

E-GEOD-8961, 

E-GEOD-8977, 

E-GEOD-9171, 

E-GEDD-9489, 

E-GEOD-9517, 

E-GEDD-9599, 

E-GEOD-9649, 

E-GEOD-9692, 

E-GEOD-9894, 

E-MEXP-1103, 

E-MEXP-1171, 

E-MEXP-1230, 

E-MEXP-1243, 

E-MEXP-1290, 

E-MEXP-1337, 

E-MEXP-1372, 

E-MEXP-1389, 

E-MEXP-1403, 

E-MEXP-1412, 

E-MEXP-1425, 

E-MEXP-1482, 

E-MEXP-1512, 

E-MEXP-1599, 

E-MEXP-1601, 

E-MEXP-1741, 

E-MEXP-1838, 

E-MEXP-1857, 

E-MEXP-1956, 

E-MEXP-1958, 

E-MEXP-2010, 

E-MEXP-2034, 

E-MEXP-2055, 

E-MEXP-2069, 

E-MEXP-2083, 

E-MEXP-2115, 

E-MEXP-2128, 

E-MEXP-2236, 

E-MEXP-2340, 

E-MEXP-2351, 

E-MEXP-2360, 

E-MEXP-2375, 

E-MEXP-2590, 

E-MEXP-2657, 

E-MEXP-3479, 

E-MEXP-3577, 

E-MEXP-3756, 

E-MEXP-3810, 

E-MEXP-555, 

E-MEXP-561, E- 

-MEXP-563, 
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E-MEXP-858, E-MEXP-884, E-MEXP-930, E-MEXP-935, E-MEXP-964, E-MEXP-980, 
E-MEXP-987, E-MEXP-993, E-MTAB-1131, E-MTAB-317, E-MTAB-372, E-MTAB-874, 
E-TABM-1020, E-TABM-1029, E-TABM-1138, E-TABM-1208, E-TABM-234, E-TABM-276, 
E-TABM-282, E-TABM-311, E-TABM-440, E-TABM-577, E-TABM-601, E-TABM-666, 
E-TABM-763, E-TABM-898 
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